Structural Regular Expressions
نویسنده
چکیده
The current UNIX® text processing tools are weakened by the built-in concept of a line. There is a simple notation that can describe the ‘shape’ of files when the typical array-of-lines picture is inadequate. That notation is regular expressions. Using regular expressions to describe the structure in addition to the contents of files has interesting applications, and yields elegant methods for dealing with some problems the current tools handle clumsily. When operations using these expressions are composed, the result is reminiscent of shell pipelines. The Peter-On-Silicon Problem In the traditional model, UNIX text files are arrays of lines, and all the familiar tools — grep, sort, awk, etc. — expect arrays of lines as input. The output of ls (regardless of options) is a list of files, one per line, that may be selected by tools such as grep: ls -l /usr/ken/bin | grep ’rws.*root’ (I assume that the reader is familiar with the UNIX tools.) The model is powerful, but it is also pervasive, sometimes overly so. Many UNIX programs would be more general, and more useful, if they could be applied to arbitrarily structured input. For example, diff could in principle report differences at the C function level instead of the line level. But if the interesting quantum of information isn’t a line, most of the tools (including diff) don’t help, or at best do poorly. Worse, perverting the solution so the lineoriented tools can implement it often obscures the original problem. To see how a line oriented view of text can introduce complication, consider the problem of turning Peter into silicon. The input is an array of blank and non-blank characters, like this: ####### ######### #### ##### #### #### # #### ##### #### ### ######## ##### #### ######### #### # # #### ## # ### ## ### # ### ### ## ## # # #### # # ## # ## The output is to be statements in a language for laying out integrated circuits: rect minx miny maxx maxy The statements encode where the non-blank characters are in the input. To simplify the problem slightly, the coordinate system has x positive to the right and y positive down. The output need not be efficient in its use of rectangles. Awk is the obvious language for the task, which is a mixture of text processing and geometry, hence arithmetic. Since the input is an array of lines, as awk expects, the job should be fairly
منابع مشابه
Structural Join Algorithm for Sequential Regular Path Expressions
XML queries employ regular path expressions to find structural patterns within XML documents. The operation of structural join is a crucial part of XML query processing. Existing approaches reduce complex join expressions to several binary structural joins. It implies generation of superfluous intermediate data. In this paper, we propose a new structural join algorithm, called sequence join alg...
متن کاملObtaining shorter regular expressions from finite-state automata
We consider the use of state elimination to construct shorter regular expressions from finite-state automata (FAs). Although state elimination is an intuitive method for computing regular expressions from FAs, the resulting regular expressions are often very long and complicated. We examine the minimization of FAs to obtain shorter expressions first. Then, we introduce vertical chopping based o...
متن کاملShorter Regular Expressions from Finite-State Automata
We consider the use of state elimination to construct shorter regular expressions from finite-state automata. Although state elimination is an intuitive method for computing regular expressions from finitestate automata, the resulting regular expressions are often very long and complicated. We examine the minimization of finite-state automata to obtain shorter expressions first. Then, we introd...
متن کاملVerified Decision Procedures for MSO on Words
Monadic second-order logic on finite words (MSO) is a decidable yet expressive logic into which many decision problems can be encoded. Since MSO formulas correspond to regular languages, equivalence of MSO formulas can be reduced to the equivalence of some regular structures (e.g. automata). This paper presents a verified functional decision procedure for MSO formulas that is not based on autom...
متن کاملGraphs Encoded by Regular Expressions
In the conversion of finite automata to regular expressions, an exponential blowup in size can generally not be avoided. This is due to graph-structural properties of automata which cannot be directly encoded by regular expressions and cause the blowup combinatorially. In order to identify these structures, we generalize the class of arc-series-parallel digraphs to the acyclic case. The resulti...
متن کاملCompile-Time Path Expansion in Lore
Semistructured data usually is modeled as labeled directed graphs, and query languages are based on declarative path expressions that specify traversals through the graphs. Regular (or generalized) path expressions use regular expression operators to specify traversal patterns. Regular path expressions typically are evaluated at run-time by exploring the database graph. However, if the database...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006